P-mode Size Optimization
Introduction
Here are a few tricks and code snippets which will help make your p-mode code that little bit more compact. No doubt you've choosen p-mode because you want to use vast amounts of memory and/or lots of nice 32-bit instructions. At first this may seem like a brilliant idea, no need for lots of segment juggling or segment overrides.. but now you've got the problem of trying to squeeze all those 32-bit immediate values and 32-bit addresses into your nice, new 4-Kb intro...
To FLAT or not to FLAT?
Okay, you've probably all heard of "FLAT-mode" (or "FLAT memory-model") many times before and no doubt have used it for lots of things, but is FLAT mode the best solution for small intros?
Sure, all memory references start from linear-address 00000000 hex, so there is no need to perform the various calculations to convert 32-bit FLAT memory pointers into 32-bit relative offsets for your CODE and DATA segments. E.g.:
DATA segment base address = 01234567 hex
FLAT memory address = 89ABCDEF hex
Relative offset = FLAT memory address - Data segment base
The simple conversion from FLAT --> 32-bit offset for the above example gives:
Relative offset = 89ABCDEF hex - 01234567 hex
= 88888888 hex
But this only needs to be done for every reference outside your main CODE/DATA/STACK segment (such as allocated or system memory).
Hmmm.. Sounds like more hard work..
Wot, no relocation info?
The main advantage of using a segmented memory-model (where the CODE, DATA, STACK segments which do NOT start at 00000000 hex) is that you don't need to store all the relocation information. The relocation information is needed for FLAT mode to translate each 32-bit address into it's actual memory load address. This relocation is needed because your program can not know where it will be loaded in memory (the Operating-System will simply allocate the required amount of memory, SOMEWHERE in memory depending on your PC's configuration and amount of memory already used).
address. opcode mnemonics
ÄÄÄÄÄÄÄÄÄ ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ ÄÄÄÄÄÄÄÄÄ
000010000 A1 xx xx xx xx mov eax, [xxxxxxxx]
000010005 90 nop
000010006 0F B6 38 movzx edi, byte ptr [eax]
000010009 29 3E xx xx xx xx sub [xxxxxxxx], edi
000010010 90 nop
In FLAT mode the above instructions the 32-bit addresses (xxxxxxxx) would need to be relocated at load time by either the Operating-System, or your own loader. You need to relocate the dword at 00010001 hex and the dword at 00001000B hex. In order to do this some extra data needs to be stored specifying which addresses in your program needs to be relocated (adjusted) by the loader.
For a segmented memory model you don't need to store this relocation info, and you also don't need to include the code to perform the relocation operation. The only relocation needed is to set the 32-bit Base address in each segment's GDT or LDT descriptor.
Pushing the immediates
There is another advantage from using the segmented-mode. You can arrange your most frequently used variables at [DS:000000xx] hex which means you can use 8-bit displacements instead of 32-bit addresses for some operations.
Instead of this:
BF xx xx xx xx lea edi, xxxxxxxx
Try this:
6A xx push offset 000000xx
5F pop edi
This saves a grand total of 2 bytes. I know, it's not much, but the above PUSH/POP trick can be used for another purpose; loading a small immediate values into a 32-bit register.
B8 FF FF FF FE mov eax, -2 (5 bytes)
6A FE push -2
58 pop eax (3 bytes)
To load the value -1 (FFFFFFFF hex) you can choose one of the following methods:
5 bytes B8 FF FF FF FF mov eax, -1
3 bytes 2B C0 sub eax, eax (flags)
48 dec eax (flags)
3 bytes 83 C8 FF or eax, -1 (flags)
3 bytes 6A FF push -1
58 pop eax (no flags)
You can also use the 67 hex address size prefix to help reduce your code size.
5 bytes A1 xx xx xx xx mov eax, [xxxxxxxx]
4 bytes 67 A1 xx xx mov eax, [xxxx]
Using a 16-bit address allows a short address of 0000...FFFF hex.
Reading from memory
Okay, now lets look at the other registers and those instructions which are 1 byte bigger than an EAX version.
6 bytes 8B 35 xx 00 00 00 mov esi, [000000xx]
5 bytes 6A xx push offset 000000xx
5E pop esi
8B 36 mov esi, [esi]
As you can see, it has only saved 1 byte. Combined with other local memory operations the [ESI] base could help save more bytes (i.e. MOV ECX,[ESI+4]). Of course if you want to read or write AL/eAX then the above PUSH xx method could save another byte.
4 bytes 6A xx push offset 000000xx
5E pop esi
AD lodsd
Combined with a number of XCHG EAX,reg32 operations this is a short way to load lots of registers from memory.
18 bytes 8B 0D xx xx xx xx mov ecx, [000000xx]
8B 15 xx xx xx xx mov edx, [000000xx]
8B 1D xx xx xx xx mov ebx, [000000xx]
Smaller:
9 bytes 6A xx push offset 000000xx
5E pop esi
AD lodsd
91 xchg eax, ecx
AD lodsd
92 xchg eax, edx
AD lodsd
93 xchg eax, ebx
If you arrange your variables and tables/buffers carefully enough the you could probably use the final [DS:ESI] to point to the start of a buffer/table.
Indexing memory using pointer & count
Suppose we have a 2 DWORDs in memory, the first is 'Tbase' which holds the start (base) address of a table, and 'Telement' which is the element number within the table we want to access. The 'Telement' must be scaled up by 4 to obtain the desired DWORD within the table.
16 bytes A1 xx xx xx xx mov eax, [Telement]
C1 E0 02 shl eax, 2
03 05 xx xx xx xx add eax, [Tbase]
8B 00 mov eax, [eax]
Using S-I-B addressing mode:
14 bytes A1 xx xx xx xx mov eax, [Telement]
8B 1D xx xx xx xx mov ebx, [Tbase]
8B 04 83 mov eax, [eax*4+ebx]
Smaller:
12 bytes 8E xx xx xx xx lea esi, Telement
AD lodsd
91 xchg eax, ecx ; ECX=Telement
AD lodsd
96 xchg eax, esi ; ESI=Tbase
F3 AD rep lodsd ; ESI+ECX*4 :)
AD lodsd
Or if in a PUSH xx range:
10 bytes 6A xx push 000000xx
5E pop esi
AD lodsd
91 xchg eax, ecx ; ECX=Telement
AD lodsd
96 xchg eax, esi ; ESI=Tbase
F3 AD rep lodsd
AD lodsd
This requires the Telement and Tbase to be stored in memory in that order and obviously messes up ESI and ECX registers. :(
Hacking the stack
I would bet that during your many coding sessions there have been times when you have simply run out of registers, especially when trying to combine multiple data-streams. Normally once the [ES:EDI] register pair have been used you are forced to use a "MOV [reg32],src + ADD reg32, xx" combination. Well, there is an ancient trick which was popular on 8-bit machines like the old ZX Speccy and C64, that of using the stack to build up a table.
On my favourite cpu, the 68000, there were 2 modes and 2 stack-pointers which were mapped into the A7 address-register. For normal use, the 68000 was in USER-mode so A7 = USP (user stack-pointer), now if an interrupt occurred the cpu switched to SUPERVISOR-mode so A7 = SSP (Supervisor stack pointer). Once the interrupt handler(s) had finished the cpu switched back to USER-mode.
Yeah, very interesting TAD, but so what?
Because only the SUPERVISOR mode stack (SSP) was used to store the SR and PC registers during an interrupt or OS 'TRAP #x' call, you could be sure that data BELOW the stack-pointer was not corrupted. This is NOT the case with the 80x86, even with Intel's fancy task switching, gates and exceptions you can NOT expect data below (e)SP to remain unchanged.
Try this for yourself:
.model tiny ; Assemble as a .COM proggy !!
.code
org 256
go:
mov ax, 55AAh
push ax
mov bp, sp
pop ax
again:
cmp ax, [bp]
jz short again
ret
end go
The above code pushes 55AA hex onto the stack, pops it and waits until [SP-2] has changed. (Yes, it does change after the first interrupt.)
This means you can only build a stable table in REVERSE order (from the end to the start). Anything above the (e)SP is safe. So you use PUSH to build up a large table in a far smaller way than using MOV eAX + STOSx.
9 bytes 8A C3 mov al, bl
AA stosb
B0 0D mov al, 0Dh
AA stosb
B0 0A mov al, 0Ah
AA stosb
Stack method:
8 bytes 6A 0A push 0Ah
6A 0D push 0Dh
4C dec esp
88 1C 24 mov [esp], bl
The advantages are clear; PUSH xx can be used to replace MOV AL,x + STOSB pairs, ESP always points to the start of the buffer (unlike EDI when DF=0) and the AL register is freed up.
Of course, it won't work inside sub-routine/procedures and you need to clean up the stack once you have finished with the data... unless...
You don't have to stick with the default stack buffer, you can change ESP as much as you wish, just remember to allow some extra workspace beyond the start of the destination buffer.
Here is a stoopid example:
lea edi, source ; some input
mov [saveESP], esp ; save the ESP
lea esp, buffer+400 ; end of buffer
sub edi, esp ; \
sub edi, 4 ; / adjust EDI offset
mov ecx, 100 ; 100 dwords
again:
mov eax, [esp+edi] ; get something..
not eax
push eax ; store it in reverse !
loop again
mov esp, [saveESP] ; restore ESP
.DATA?
saveESP dd ?
dd 1024 dup (?) ; extra 4K of ESP workspace
buffer dd 100 dup (?) ; the actual buffer
Simple (and small) memory-alloc
This is a translation of my 7 byte Real/v86-mode memory allocate routine which I presented in Wilby 4 (the nice 4Kb diskmag). The idea is simple, use the stack-pointer to allocate memory blocks.
memalloc:
pop eax ; get (EAX)=return-address
memclear:
push 00h
loop memclear ; clear & alloc (ECX) DWORDs
jmp eax ; return back to caller
You pass the number of DWORDs in (ECX) and it returns the base address of the 'allocated' memory in the (ESP) register.
mov ecx, (4096/4) ; allocate 4Kb
call memalloc
lea edi, [esp] ; [DS:EDI] -->; 4096 byte buffer
mov ecx, (50000/4) ; allocate 50000 bytes
call memalloc
lea esi, [esp] ; [DS:ESI] -->; 50000 byte buffer
There are some things to remember about this method:
(1) Remember to give your program a BIG stack with enough memory for all your requirements.
(2) You can only de-alloc memory in REVERSE order!! (In the above example this means ADD ESP,50000 then ADD ESP,4096.)
Cleaning up the stack
If you are using the stack to pass/return parameters then you normally have to clean up the ESP register after each call.
push xx
push xx
call MyProc
add esp, 8
push xx
push xx
call MyProc
add esp, 8
But you don't have to. You can clean up after a number of calls or even reload the ESP from within your main loop.
push xx
push xx
call MyProc
push xx
push xx
call Myproc
add esp, 8+8 ; saves 3 bytes
If some of the parameters are identical for each call then you reuse some of the parameters for a number of calls.
push xx
push yy
call MyProc
pop eax ; throw away parameter yy
push zz ; replace it with zz
call Myproc
add esp, 8
In the above we call 'Myproc' with parameters xx, yy and then with xx,zz.
PUSHA and POPA procedures
With the 80386+ [ESP] addressing modes together with PUSH xx, PUSHAD and POPDA, passing parameters on the stack becomes very easy, and also very small in terms of code size. These days complex, highly reusable procedures are widely used in small demos/intros. The sheer speed of the current 'standard' CPU makes unrolling loops or using macros to build huge graphical/mathematical almost pointless. Instead of those old, bloated binary offerings, more flexible procedures are desired which often need a large number of parameters. In order to cope with them, high-level/compiler stack based techniques can be used.
Let's use an 'DrawRectangle' for a simple example...
mov al, 7 ; color
mov ebx, 50 ; x
mov ecx, 100 ; y
mov edx, 128 ; width
mov esi, 90 ; height
call DrawRectangle
...
DrawRectangle:
pushad ; this allows us to use ALL the
; registers within this procedure..
mov edi, [esp+18h] ; 'ECX' from PUSHA structure
imul edi, 320
add edi, [esp+10h] ; + 'EBX' (x)
add edi, 000A0000h
mov edx, [esp+04h] ; 'ESI' (height)
@@1:
mov ecx, [esp+14h] ; 'ECX' (width)
rep stosb
sub edi, [esp+14h] ; \ next destination line
add di, 320 ; /
dec edx
jnz @@1
popad
The above (pointless) procedure demonstrates how to use the PUSHAD and POPAD instructions in order to preserve registers and also how to access these stacked input parameters. Just in case you're not familiar with PUSHAD, this is what the stack looks like after a PUSHAD instruction:
ESP --> EDI [ESP+00h]
ESI [ESP+04h]
EBP [ESP+08h]
ESP [ESP+0Ch] ** value ignored by POPAD **
EBX [ESP+10h]
EDX [ESP+14h]
ECX [ESP+18h]
EAX [ESP+1Ch]
... [ESP+20h]
The beauty of PUSHAD and POPAD is that they are only 1-byte opcodes, so are ideal for other optimization tricks. Of course you don't have to use registers to pass parameters, you can push values directly onto the stack. This means that you can use the 2-byte PUSH xx instruction, and it allows parameters to be passed from ANY register or memory-location. In short this means far less register-juggling!!
push 7 ; color 7
push 50 ; x1
push [ycoord] ; y1
push 300 ; x2
call HorzLine
...
HorzLine:
pushad
mov edi, 320
imul edi, [esp+32+4+04h]
add edi, [esp+32+4+08h]
add edi, 000A0000h
mov ecx, [esp+32+4+00h]
sub ecx, [esp+32+4+08h]
mov al, [esp+32+4+0Ch]
rep movsb
popad
ret 16 ; adjust stack (4x PUSHes)
Local work-space.
Now suppose in your procedure you need to allocate some space for local variables/workspace. The obvious solution would be something like:
MyProc:
60 pushad
83 EC 14 sub esp, 20 ; make room for 20 bytes
... do stuff ...
83 C4 14 add esp, 20 ; trash local 20 bytes...
61 popad
ret
In the above case we only need 20 bytes of work-space and on average it's usually less than 32 bytes, so here is a little trick:
MyProc:
60 pushad
60 pushad ; make room for 32 bytes
... do stuff ...
61 popad ; trash local 32 bytes...
61 popad
ret
As you can see, we have saved 4 bytes by using the PUSHAD to decrease ESP by 32 bytes instead of using the normal SUB ESP,xx and ADD ESP,xx instructions. This can lead on to another nice trick...
Suppose we pass parameters in registers (like in an earlier example), those parameters have been auto-duplicated on the stack by the 2nd PUSHAD!! This means we could screw around with those duplicated values without having to restore them..
mov al, 0
mov ecx, 64000
mov edi, 000A0000h
call Stoopid
...
Stoopid:
pushad ; store all registers
pushad ; make a 2nd-copy (local vars.)
@@1:
stosb
dec dword ptr [esp+18h]
jnz short @@1
popad ; trash 32-byte work-space vars.
popad ; restore all registers
I know its a completely stoopid example, but hopefully you can see how useful this '2nd-copy PUSHAD' method could be... It removes the need to copy passed register values into local variables. Of course the above code is best done using: REP STOSB, but if your procedure has no free registers left then perhaps the '2nd-copy PUSHAD' trick might be worthwhile. ;)
Looping the loop.
Ah, the coder's favourite item, the loop.
B9 000000xx mov ecx, 000000xx
again:
.. ..
E2 xx loop again (7 bytes)
Better:
6A xx push xx
59 pop ecx
again:
.. ..
E2 xx loop again (5 bytes)
If you're running short of registers, why not use the nice [ESP+nn] addressing mode..
6A xx push xx
again:
.. ..
FF 0C 24 dec dword ptr [esp]
75 xx jnz again (7 bytes)
58 pop eax (clean up)
As a bonus, you can load any register with 00000000 hex during the final stack clean-up operation (the "pop eax").
Toggling 16-bit values
If you are using 16-bit values and want a quick way to toggle between two different values, then try using the BSWAP instruction.
BB xx xx xx xx mov ebx, xxxxxxxx
swap: 66 81 F3 xx xx xor bx, xxxx (10 bytes)
BB xx xx xx xx mov ebx, 12345678h
swap: C1 CB 10 ror ebx, 16 (8 bytes)
BB xx xx xx xx mov ebx, 34125678h ** NOTE **
swap: 0F CB bswap ebx (7 bytes)
The advantage of BSWAP is that it doesn't change the flags and is 1 byte shorter. But be careful of the byte ordering!! The bytes in the top word MUST be reversed for the BSWAP to work (3412 hex instead of 1234 hex!!).
If the toggle value is small enough for an 8-bit value then you can combine the above with the PUSH+POP load method.
6A xx push 000000xx
69 pop ecx
swap: 0F C9 bswap ecx (5 bytes)
This means you can toggle between 00 and some small values 00xx. Combined with a prefix override this 'might' be useful for the string instructions using the CX register.
You can also toggle between FFFF hex and FF80...FFFF hex by pushing the values -128 to -1.
Final thought.
The tricks presented above 'might' make your code even bigger. It depends if you use some kinda executable compression. In which case you may want to unroll small loops and try to forget about what you have just read. ;) To help your compressor, you might have to un-optimized your code to make it more 'phraze-match-friendly'.
But if your code is so tightly optimized by hand that no compressor can make your code smaller (yes, this can happen) then perhaps this article could help you chew a few bytes off your production to reach the magic X byte limit.
Happy counting...